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Abstract 

We present a simple deep learning framework to simultaneously predict keypoint 
locations and their respective visibilities and use those to achieve state-of-the-art perfor¬ 
mance for fine-grained classification. We show that by conditioning the predictions on 
object proposals with sufficient image support, our method can do well without com¬ 
plicated spatial reasoning. Instead, inference methods with robustness to outliers, yield 
state-of-the-art for keypoint localization. We demonstrate the effectiveness of our accu¬ 
rate keypoint localization and visibility prediction on the fine-grained bird recognition 
task with and without ground truth bird bounding boxes, and outperform existing state- 
of-the-art methods by over 2%. 


Introduction 


Fine-grained image categorization is the task of accurately separating categories where the 
distinguishing features may be as minute as a different fur pattern, shorter horns, or a smaller 
beak. The widely accepted and popular approach of dealing with such a task is intuitive: 
align analogous regions and hone in on where you expect the differences to be. The analo¬ 
gous regions are usually defined by keypoints. Therefore, to perform well one would require 
not only accurate object-level localization, but also part and/or keypoint localization. 

For keypoint localization, the most common approach is to learn a set of keypoints de¬ 
tectors to model appearance and an associated spatial model [□, 123, □, O] to capture their 
spatial relations. Keypoint detectors generate a set of candidates and a spatial model is used 
to infer the most likely configuration. Keypoint detectors typically model local appearance 
and thus an approach has to rely on expressive spatial models to capture long range de¬ 
pendencies. Alternatively, the keypoint detectors could condition their predictions on larger 
spatial support and jointly predict several keypoints [□], then the need for expressive spatial 
models could be eliminated leading to simpler models. 
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For effective fine-grained category detection, the keypoint localization method must have 
high accuracy, low false positive rates, and low false negative rates. Missed or poorly local¬ 
ized predictions make it impossible to extract the relevant features for the task at hand. If 
a keypoint is falsely determined to be present within a region, it is hard to guarantee that it 
will appear at a reasonable location. In the case of localizing keypoint-defined regions of 
an image, such as head or torso of a bird, a single outlier in the keypoint predictions can 
significantly distort the predicted area. This specific case is noteworthy as several of the cur¬ 
rent best-performing methods on the CUB 200-2011 birds dataset [ED] rely on deep-network 
based features extracted from localized part regions [□, □, E2, El]. 

In this work, we tackle the problem of learning a keypoint localization model that re¬ 
lies on larger spatial support to jointly localize several keypoints and predict their respective 
visibilities. Leveraging recent developments in Convolutional Neural Networks (CNNs), we 
introduce a framework that outperforms the state-of-the-art on the CUB dataset. Further, 
while CNN-based methods suffer from a loss of image resolution due to the fixed-sized in¬ 
puts of the networks, we introduce a simple sampling scheme that allows us to work around 
the issue without the need to train cascades of coarse-to-fine localization networks [123, IZ3]. 
Finally, we test our predicted keypoints on the fine-grained recognition task. Our keypoint 
predictions are able to significantly boost the performance of current top-performing meth¬ 
ods on the CUB dataset. Our major contributions include: 

1. State-of-the-art keypoint and region (head, torso, body) localization with visibility 
prediction using a single neural network based on the AlexNet [I2D] architecture. 

2. A sampling scheme to significantly improve keypoint prediction performance without 
the use of cascades of coarse-to-fine localization networks [ED, E3]. 

3. Improvement of the state-of-the-art performance of [El] on the CUB classification 
task by using our predicted keypoints with significant gains when the groundtruth bird 
bounding box is not provided during test time. 


2 Related Work 

Fine Grained Recognition: Prior work focuses on localizing informative parts of objects 
and then extracting features from them for classification. Using pairs of localized keypoints, 
Berg et al. [□] learn a set of highly discriminative features for fine-grained classification. 
Farrell et al. [□] and Branson et al. [Q] use pose normalized representations of birds and 
their regions (head, torso, entire bird) followed by feature extraction for classification. Liu et 
al. [E3] extend the exemplar based model of [ID] with pose information for keypoint localiza¬ 
tion and subsequent classification of birds. Based on the very successful framework of the 
RCNN [03], Zhang et al. [El] perform bird classification using three localized bird regions: 
head, torso, and full body. 

The above mentioned methods are highly dependent on accurate keypoint and bird region 
localization. In fact, [□, O] rely on the groundtruth bird bounding box at test time to localize 
keypoints and to perform classification. Our method overcomes this bottleneck of localiza¬ 
tion and we demonstrate state-of-the-art classification performance using the framework of 
[ED] along with our localized regions. 

Object Region Proposals: Region proposals combined with deep network systems are an 
efficient solution for finding objects in an image. Recent works use region proposals as initial 
object candidates to either reduce their search space [O, E3] or to refine their localization 
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\m- Instead of exhaustively sliding a window feature extractor on an image at all locations, 
scales, and aspect ratios, region proposal methods are used to quickly identify a smaller 
and manageable set of image regions which have high recall for objects present in the image. 
The time saved enables the use of more expensive feature extraction and processing. Popular 
region proposal methods include [0, 0, O, E3, E3]. In our work, we use Edge Boxes [E3] for 
its fast computational speed and its effective scoring method that allows us to further reduce 
the number of candidates needed test time. 

Pose Detection & Regression with Deep Networks: Our method for keypoint localization 
mainly draws inspiration from the use of regression in networks in the MultiBox approach 
by Erhan et al. [O]. The authors train a deep network which regresses a small number of 
bounding boxes (~ 100) as object bounding box proposals, along with a confidence value 
for each bounding box. 

Regression for localization of keypoints has previously been explored by Toshev et 
al. [EO]. They use a cascade of deep network based regressors for human pose estimation 
to refine the keypoint predictions. At each stage, the network uses a region around the pre¬ 
vious prediction to acquire higher resolution inputs and solve the fixed-resolution network 
input issue. In contrast, our work relies on multiple regions sampled with Edge Boxes from 
the image and simultaneously predicts all keypoints. Varying sized regions provide varying 
resolution and context, and we achieve more robust predictions from multiple regions with 
statistical outlier removal. 

One of the closest works to ours on the CUB dataset is that of Liu et al. [IZ3, E3]. They 
achieve remarkable performance on both keypoint localization and visibility prediction using 
ensembles of pose exemplars and part-pair detectors. We compare our performance with 
theirs using metrics defined in their work. 

In contemporary works, the Deep LAC model [IZD] bridges a localization regression net¬ 
work and classification network to train simultaneously to perform on similar tasks to our 
own. While their setup is very similar to our own, they directly target the localization of en¬ 
tire head and torso boxes whereas we target the keypoints that define said boxes. We include 
their accuracies for comparison in the localization and recognition experiments. 


3 Method 

We design our model to simultaneously predict keypoint locations and their visibilities for 
a given image patch. To share the information across categories, our model is trained in a 
category agnostic manner. At test time, we efficiently sample each image with Edge Boxes, 
make predictions from each Edge Box, and reach a consensus by thresholding for visibility 
and reporting the medoid. 

3.1 Training Convolutional Neural Networks for Keypoint Regression 

Our network is based on AlexNet m, but modified to simultaneously predict all keypoint 
locations and their visibilities for any given image patch. AlexNet is an architecture with 
5 convolutional layers and 3 fully connected layers. Henceforth, we refer to the 3 fully 
connected layers as fc6, fc7, and fc8. We replace the final fc8 layer with two separate output 
layers for keypoint localization and visibility respectively. Our network is trained on Edge 
Box [E3] crops extracted from each image and is initialized with a pre-trained AlexNet [120] 
trained on the ImageNet [0] dataset. Each Edge Box is warped to 227x227 pixels before it 
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can be fed through the network. We apply padding to each Edge Box such that the warped 
227 x227 pixel crop has 16 pixels of buffer in each direction. 

Given N keypoints of interest, we train a network to output an N dimensional vector v 
and a 2N dimensional vector l corresponding to the visibility and location estimates of each 
of the keypoints ku i G {1,TV}, respectively. The corresponding groundtruth targets during 
training are v and /. We define v to consist of indicator variables v; G {0,1} such that v* = 1 
if keypoint h; is visible in the given Edge Box image before padding is performed, and 0 
otherwise. The groundtruth location vector l is of length 2N and consists of pairs (. l Xi ,l yi ) 
which are the normalized (jt,y) coordinates of keypoint ki with respect to the un-padded Edge 
Box image. Output predicted from the network, v; G [0,1], acts as a measure of confidence 
of keypoint visibility, and 2D locations predicted by the network are denoted by 

We use the Caffe framework [D33] for training our deep networks. To train a network 
optimized for both tasks simultaneously, we define our losses as follows: 


Cvis = I|v- v\\j and C ioc = £ \’t ■ [(l Xl - t Xi f + (l yt - l yi ) 2 ] 

i= 1 

£net = t-'vis T~ £loc 


( 1 ) 

( 2 ) 


The visibility loss C v i s is the squared Euclidean distance between the ground truth vis¬ 
ibility label vector v, and the predicted visibility vector v. The values in our v’s always lie 
between 0 and 1 as they are obtained after squashing network outputs with a sigmoid func¬ 
tion. The keypoint localization loss Ci oc is a modified Euclidean loss, in which we set the 
loss between the prediction and the target to be 0 if v* = 0 i.e. if the keypoint ki is absent in 
the given image. The final training loss (C net ) is given by the sum of the two losses. 

To construct our training set for predicting keypoint visibility and locations, we extract 
up to 3000 Edge Boxes per image. To train a robust predictor, we need a collection of 
training images with high variability in which different subsets of keypoints are visible. We 
generate examples that satisfy this criteria by retaining a subset of Edge Boxes which have 
at least 50% of their area contained inside the groundtruth bounding box and have at least 
20% intersection over union overlap (IOU) with the groundtruth bounding box. We also 
included up to 50 random boxes per image from outside the bounding box as negative back¬ 
ground examples. We augment our dataset with left/right flips. After flipping, appropriate 
changes were applied to the label vectors. This consisted of swapping orientation-sensitive 
keypoints such as “left eye” and “left wing” with “right eye” and “right wing”, and updating 
their respective coordinates and visibility indicators. We first train our model on 25 images 
per class and tune algorithmic and learning rate parameters on a held-out validation set com¬ 
prising the remaining 4-5 images per class. Finally, we re-train using the entire training set 
before running our model on the test set. 


3.2 Combining Multiple Keypoint Predictions 

Our algorithm for dealing with predictions from multiple Edge Boxes at test time is illus¬ 
trated in Fig. 1. Due to the variance from making predictions from multiple unique subcrops 
of the image, we need to form a consensus from the multiple predictions. In our experiments, 
we found that after removing predictions with low visibility confidences, the remaining pre¬ 
dictions had a peaky distribution around the ground truth. We use medoid as a robust estima¬ 
tor for this peak and found it to be effective in most cases (Fig. 5). For the task of localizing 
part regions around keypoints (described in section 3.3), we found on our train/val split that 
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Input Image Edge Boxes and associated Outlier Removal and Consensus 

predictions for Right-Eye keypoint 


Figure 1: The pipeline of our keypoint localization process: Given an input image, we extract multiple 
edge boxes. Using each edge box, we make predictions for the location of each of the 15 keypoints, 
along with their visibility confidences. We then find the best predicted location by performing confi¬ 
dence thresholding and finding the medoid. The process is illustrated for the right eye keypoint (Black 
edge boxes without associated dots make predictions with confidences below the set threshold, and 
green is an outlier with a high confidence score). 


we achieved better localization performance if we kept a set of good predictions (referred to 
as inliers) instead of using only the medoid. We now describe our procedure for obtaining 
a tight set of inliers and our choice of parameters. For the keypoint prediction task, we only 
use the visibility thresholds and report the medoid. 

Case 1: Ground Truth Object Box Given: We first describe our method in the case that 
the ground truth object boxes are given. Using the ground truth object box, we retain the 
generated Edge Boxes that are mostly contained within and have an IOU of at least 0.2. This 
results in roughly 50-200 remaining Edge Box subcrops per image. Each subcrop is then 
independently fed through our keypoint prediction network, returning a set of normalized 
keypoint predictions and visibilities. 

Because each subcrop is expected to cover less than the whole object and contain only 
a subset of the keypoint predictions, we drop any prediction if its corresponding visibility is 
below 0.6. Because we make use of multiple overlapping subcrops, it is very likely that at 
least one of them will lead to a prediction with a sufficiently high visibility score, thereby 
allowing us to be much more aggressive with the false positive filtering. 

Given multiple remaining keypoint predictions per keypoint with sufficiently high vis¬ 
ibility scores, we then proceed to remove outliers. To do so, we threshold on a modified 
Z-score based on a description given by Iglewicz and Hoaglin [HE]. The modified Z-score is 
one that is re-defined using medoids and medians in place of means, as the former estimates 
are more robust to outliers. 

Let pi where i = 1, • • • , M be the set of M surviving un-normalized keypoint predictions 
(for a given keypoint) in (x,y) image coordinates. We first define p to be the medoid predic¬ 
tion such that: 

M 

p = argmin L\\Pj~ A'lk> (3) 

Pj i= 1 

In other words, p is the prediction such that its Euclidean distance from all other pre¬ 
dictions for that keypoint is minimal. While this optimization is costly at a large scale, we 
typically deal with only 10-20 predictions at a time after thresholding for visibility scores. 
To compute the modified Z-score we use: 


A \ \Pi~P\\2 
median (| |/? e - — ^|| 2 ) ’ 




(4) 


Here, the denominator is the median absolute deviation, or simply the median distance 
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from the medoid p. We use the recommended A = 0.6745. The above procedure is separately 
computed for all 15 sets of keypoint prediction candidates. Finally, we drop any keypoint 
prediction with Z* > 0.35, a threshold that was experimentally determined on the held-out set. 

Case 2: Ground Truth Object Box Not Given: Our ground truth object box not given 
scenario requires little change from the above case. Using the Edge Box ranking, we found 
that most of our “good” Edge Boxes fell within the top 600 Edge Boxes per image, saving us 
a lot of computation. Tuning parameters on our train/val split, we found that an even more 
aggressive visibility threshold of 0.94 and a Z-score threshold of 0.3 gave the best results. 

Medoid-Shift: While the simple Z-score thresholding combined with the medoid achieves 
excellent results, as we will demonstrate in the results section, we were able to further im¬ 
prove our results by using medoid-shifts [El]. We use the medoid of the largest output cluster 
from the algorithm instead of the medoid computed over all the visibility-filtered predictions. 


3.3 Bird Classification 

We verify the effectiveness of our localized parts by implementing the simple classification 
framework as described in [ED]. Using the keypoints, three regions are identified from each 
bird: head, torso, and whole body. The head is defined as the tightest box surround the beak, 
crown, forehead, eyes, nape, and throat. Similarly, the torso is the box around the back, 
breast, wings, tail, throat, belly, and legs. The whole body bounding box is the object bound¬ 
ing box provided in the annotations. To perform classification, fc6 features are extracted 
from these localized regions, concatenated into a feature vector of length 4096 x 3, and used 
for 200-way linear 1-vs-all SVM classification. 

To handle the case when ground truth bounding box is not given at test time, we use an 
overlap heuristic based on the predicted head and torso boxes. We first start by finding the 
tightest box around the predicted head and torso boxes. While this initial box will do well 
for birds in their canonical poses, it will result in an undersized box in many cases because 
the keypoints do not always capture the full extent of the bird. We then assume that there 
exists an Edge Box with a high edge score that better captures the whole bird. To let the box 
expand to capture more of the object, we first identify the Edge Boxes such that the tightest 
box is at least 90% contained within and has at least 50% IOU overlap. The final whole body 
bounding box is the Edge Box that passes both criteria that also has the highest Edge Box 
object score. If no Edge Box passes the overlap test, we fall back to the starting tightest box. 


4 Experiments and Results 

We evaluate our prediction model on the challenging Caltech-UCSD Birds dataset [E9]. This 
dataset contains 200 bird categories with 15 keypoint location and visibility labels for each 
of the total of 11788 images. We first evaluate our keypoint localization and visibility pre¬ 
dictions against other top-performing methods. Next, we demonstrate their effectiveness in 
the fine-grained categorization task by significantly improving state-of-the-art through better 
localization. 
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Na 

Ta 
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Total 

m 

62.1 

49.0 

69.0 

67.0 

72.9 

58.5 

55.7 

40.7 

71.6 

70.8 

40.2 

70.8 

59.7 

m 

64.5 

61.2 

71.7 

70.5 

76.8 

72.0 

70.0 

45.0 

74.4 

79.3 

46.2 

80.0 

66.7 

Ours 

74.9 

51.8 

81.8 

77.8 

77.7 

67.5 

61.3 

52.9 

81.3 

76.1 

59.2 

78.7 

69.1 


Table 2: Comparison of per-part PCP with Liu et al. [E2, □]. The abbreviated part names from left 
to right stand for back, beak, belly, breast, crown, forehead, eye, leg, wing, nape, tail, and throat. 


4.1 Keypoint Localization and Visibility Prediction 


Table 1 reports our keypoint and visibility performance without using any ground truth 
bounding box information. Our medoid method reports the medoid of predictions above 
a visibility threshold, as seen in the red star in Fig. 5. Our “mdshift” method reports the new 
medoid computed using medoid-shift, which is the blue circle in Fig. 5. We used the eval¬ 
uation code provided by the authors of [IZ3] to measure our performance using the metrics 
defined in their work. In short, PCP (Percent Correct Parts) is the percentage of keypoints 
localized within 1.5 times the annotator standard deviation. We received the pre-computed 
standard deviatons and evaluation code from the authors of [IZ2] to avoid any discrepancies 
during evaluation. AE (Average Error) is the mean euclidean prediction error, capped at 5 
pixels, computed across examples where a prediction was made and a ground truth location 
exists. FVR and FIR refer to False Visibility Rate and False Invisibility Rate respectively. 
The additional methods for comparison are the same as listed in their paper. 

Compared to the top-performing meth¬ 
ods that also predict visibility, our method 
achieves the best numbers in three out 
of four metrics. Our PCP and AE met¬ 
rics outperform other methods in the table, 
with our medoid-shift variant performing 
slightly better. Our FIR is higher because 
we are using the visibility threshold tuned 
on the part-localization task. A slightly 
lowered threshold would lower the FIR and 
raise the FVR without significantly affect¬ 
ing the PCP. 


Method 

PCP 

AE 

FVR 

FIR 

Poselets [i] 

24.47 

2.89 

47.9 

17.15 

Consensus [□] 

48.70 

2.13 

43.9 

6.72 

Exemplar [E3] 

59.74 

1.80 

28.48 

4.52 

Ours (medoid) 

68.7 

1.4 

17.1 

5.2 

Ours (mdshift) 

69.1 

1.39 

17.1 

5.2 

Human [123] 

84.72 

1.00 

20.72 

6.03 


Table 1: Localization and Visibility Prediction 
Performance of various methods without using the 
ground truth Bounding Box 


The highest reported PCP is 66.7% due to Liu et al. m, which also predicts visibilities 
but did not report them. We compare against their PCP in Table 2. Because our method 
differs significantly from theirs, we outperform them in only 7 of the listed part categories 
despite having a better overall PCP, suggesting further improvements by targeting the differ¬ 
ences in our models’ behaviors. 


4.2 Head and Torso Localization 

We evaluate our ability to localize keypoint-defined part-regions on the test set. Example 
predictions can be seen in Fig. 3. In Table 3, we compare our part-localization accuracies 
with other methods. We also compare with the simple case where we make predictions by 
feeding just the ground truth boxes through the network (single GT Bbox). (We also tried 
re-training a network on just GT bounding boxes for this case, but it didn’t perform as well.) 
Unlike the keypoint prediction task, we retain a set of inliers after Z-score thresholding to 
determine the extent of each part box. This was determined to perform best on our validation 
set. The reported metrics are the percentage of heads, torsos, and whole body boxes that 
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Method 

Head 

Torso 

Whole Body 


Part-Based RCNN [El] 

68.2 

79.8 

N/A 


Deep LAC [CD] 

74.0 

96.0 

N/A 

GT Bbox 

Ours (single GT bbox) 

75.6 

90.2 

N/A 


Ours (multiple) 

88.8 

93.9 

N/A 


Ours (multiple, mdshift) 

88.9 

94.3 

N/A 


Part-Based RCNN [E31] 

61.4 

70.7 

88.3 


Exemplar [□] 

79.9 

78.3 

N/A 

No GT Bbox 

Ours (multiple) 

87.8 

89.0 

84.5 


Ours (multiple, mdshift) 

88.0 

88.7 

84.6 


Table 3: Comparison of Part Localization Performance: Our method based on keypoint prediction 
from Edge Boxes shows significant improvement over previous work. 

were correctly localized with a >50% IOU. The ability to perform competitively on this task 
should correlate with a high PCP and low FVR and FIR. 

The results in Table 3 demonstrate that 
our keypoint predictions are useful in gen¬ 
erating accurate part boxes. Our lower per¬ 
forming single GT Bbox method suggests 
that our use of multiple predictions from 
Edge Boxes allows for more accurate pre¬ 
dictions. Further, we also computed head 
and torso boxes using the keypoint predic¬ 
tions from Liu et al. m as shown in the 
“Exemplar” row. Based on their accuracy, 
their boxes should also be able to improve 
the results of [El]. 

In Fig. 2, we also look at how our lo¬ 
calization ability is affected by the number 
of top Edge Boxes sampled from the image. 

As we previously noted, the Edge Box edge 
scoring is effective enough that most of the 
sufficiently well localized boxes we used in the ground-truth bounding box given case fell 
within the top 600. However, as our model predicts individual keypoints and visibilities, it 
does not need a well localized box at test time at all. It merely needs a set of Edge Boxes that, 
combined, provide enough coverage over the actual bird for it to predict keypoint locations 
and visibilities. As such, our model is able to continue to localize over 70% of the head and 
torso boxes with at least 50% IOU as the number of Edge Boxes drops to 50. While the 50% 
IOU recall of Edge Boxes for head and torsos on the validation set were 66.36% and 95.12% 
at 600 boxes and 17.28% and 62.20% respectively at 50 boxes, we demonstrate that we were 
able to localize these parts with higher accuracy than would have been achievable had we 
used an RCNN-based approach and tried to map Edge Boxes to the parts. 

4.3 Fine-Grained Classification 

We now test our part-predictions in a fine-grained classification setting. These results are 
shown in the right half of Fig. 4. To do this, we train three networks to re-implement the 
three-part framework as Zhang et al. [El] as described in section 3.3. The oracle performance 
refers to the classification assuming ground truth keypoints at test time. While Zhang et al. 
[El] reports an oracle accuracy of 82.0%, we compare with the highest we were able to 
achieve with our implementation: 81.5%. This is likely due to minor differences in network 


£ 

o 

C3 

u 

=3 

O 

O 

< 

6 

o 

hJ 


Figure 2: Localization performance on our vali¬ 
dation set while varying the number of top Edge 
Boxes used. We only rely on the Edge Boxes as 
a means of efficient sampling of the image, so our 
performance is barely affected by the loss of well- 
localized boxes. 
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Figure 3: Examples of good (left) and failed (right) localization results: The ground truth boxes are in 
solid black. The head, torso, and whole body boxes are in green, blue and red respectively. The head 
is correctly localized in most of the above examples. In the top row middle example, even though the 
whole body box IOU is low, most of the missed area is actually background due to the bird extending 
its wings. In the bad examples, we show that we mostly fail in rare close-ups and when there are 
multiple instances. 


training parameters. We also tried both fc6 and fc7 features and found that fc6 performed 
a little better. Although Zhang et al. [El] and Branson et al. [□] noted that their drops in 
accuracy from using ground truth parts to predicted parts were surprisingly small, our relative 
improvements suggest that it is still worthwhile to focus on better localization. Further, we 
perform at least as well as the contemporary Deep LAC model [ED], likely due to our better 
localization of the more discriminative head regions. 

In the left half of Fig. 4, we show how our accuracy is affected from the ground truth 
keypoint ideal case (Oracle) to the use of predicted keypoints (GT Bbox), and finally with 
the GT Bbox removed (No GT Bbox). Unsurprisingly, the better localization at test time 
allows for a significantly smaller drop as annotations are removed. 

The same plot also shows an ablation test of individual parts. It appears that the bulk of 
our performance comes from discriminating localized bird heads. This is also supported by 
[B] which observed that of their learned poses, the one that corresponded to the head was the 
most discriminative. This suggests that most of our improvement over our base method of 
[EH] comes from significantly improving our head part localization (shown in Table 3). 


g 8° 

I 70 

(Z3 
(Z2 

U 60 

□ Oracle 0 GT Bbox 0 No GT Bbox 




Method 

Acc. 

Oracle 

Oracle Parts + SVM 

81.5 


DPD [EO] 

51.0 


Symbiotic [□] 

59.4 


Alignment [□] 

62.7 

GT Bbox 

DeCAF [ED] 

65.0 

POOF [□] 

56.8 


Part-Based RCNN [El] 

76.4 


Deep LAC [ED] 

80.3 


Ours (mult, medoid) 

80.3 


Ours (mult, mdshft) 

80.3 


Pose Norm [□] 

75.7 

No GT Bbox 

Part-Based RCNN [El] 

73.9 

Ours (mult, medoid) 

78.2 


Ours (mult, mdshft) 

78.3 


Figure 4: On the left we show a comparison of classification accuracies obtained using combinations 


of parts localized under different conditions (H: Head, T: Torso, B: Whole Body). On the right, we 


compare our classification accuracy with other works. 
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Outliers Inliers Mean AMedian ★ Medoid # Medoid Shift Ground Truth 


Figure 5: Qualitative results for a subset of the keypoints. Predictions for most of the images clus¬ 
ter tightly. Therefore, simple prediction methods such as medoids work well. Medoid shift adds to 
the robustness, leading to further improvements (second last column). Primary failure mode is when 
visibility thresholding fails to rule out clusters of false positives (bottom right). 

5 Conclusion 

We presented a method for obtaining state-of-the-art keypoint predictions on the CUB dataset. 
We demonstrated that conditioning the predictions on multiple object proposals for suffi¬ 
cient image support can reliably improve localization predictions without using a cascade 
of coarse-to-fine networks. We tackle the problem of fixed-size inputs when using neural 
networks by sampling predictions from several boxes and determining the “peak” of the pre¬ 
dictions with medoids. We then use our predictions to significantly improve state-of-the-art 
methods on the fine-grained classification task on the CUB dataset. In future work, we in¬ 
tend to apply this method to human datasets as well as to combine it with more sophisticated 
inference methods to deal with multiple instances. 
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